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Abstract 

Advances in bio-technology have made available massive amounts of functional, 
structural and genomic data for many biological sequences. This increased availability 
of heterogeneous biological data has resulted in biological applications where a multiple 
sequence alignment (msa) is required for aligning similar features, where a feature is 
described in structural, functional or evolutionary terms. In these applications, for a 
given set of sequences, depending on the feature of interest the optimal msa is likely 
to be different, and sequence similarity can only be used as a rough initial estimate on 
the accuracy of an msa. This has motivated the growth in template based heuristics 
that supplement the sequence information with evolutionary, structural and functional 
data and exploit feature similarity instead of sequence similarity to construct multiple 
sequence alignments that are biologically more accurate. However, current frameworks 
for designing template based heuristics do not allow the user to explicitly specify infor- 
mation that can help to classify features into types and associate weights signifying the 
relative importance of a feature with respect to other features, even though in many 
instances this additional information is readily available. This has resulted in the use 
of ad hoc measures and algorithms to define feature similarity and msa construction 
respectively. 

In this paper, we first provide a mechanism where as a part of the template infor- 
mation the user can explicitly specify for each feature, its type, and weight. The type 
is to classify the features into different categories based on their characteristics and the 
weight signifies the relative importance of a feature with respect to other features in 
that sequence. Second, we exploit the above information to define scoring models for 
pair-wise sequence alignment that assume segment conservation as opposed to single 
character (residue) conservation. Finally, we present a fast progressive alignment based 
heuristic framework that helps in constructing a global msa by first constructing an 
msa involving only the informative segments using exact methods, and then stitch into 
this the alignment of non-informative segments constructed using fast approximate 
methods. 



Key words: Analysis of algorithms; Bioinformatics; Computational Biology; Multiple Se- 
quence Alignment; Template Based Heuristics 

1. Introduction 

A global multiple sequence alignment (msa) [7, 17, 29] of a set S = {Si, S%, S^} of k 
related protein sequences is a way of arranging the characters in S into a rectangular grid 
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of columns by introducing zero or more spaces into each sequence so that similar sequence 
features occur in the same column, where a feature can be any relevant biological informa- 
tion like secondary/tertiary structure, function, domain decomposition, or homology to the 
common ancestor. The goal in attempting to construct a global msa is either to identify 
conserved features that may explain their functional, structural, evolutionary or phenotypic 
similarity, or identify mutations that may explain functional, structural, evolutionary or phe- 
notypic variability. 

Until recently, sequence information was the only information that was easily available for 
many proteins. So, the measures that were used to evaluate the quality (accuracy) of a msa 
were mostly based on sequence similarity. The sum of pairs score (SP-score) and Tree score 
were two such measures that were widely used. For both these measures, the computation 
of an optimal msa is known to be NP-Complete [54]. So, in practice most of the focus is 
on designing fast approximation algorithms and heuristics. From the perspective of approx- 
imation algorithms, constant polynomial time approximation algorithms are known for the 
SP-score [17,55] and polynomial time approximation schemes (PTAS) [55] are known for the 
Tree score. However, in practice, these approximation algorithms have large run-times that 
makes them not very useful even for moderate sized problem instances. From the perspective 
of heuristics, most heuristics are based on progressive alignment [18, 20, 48, 49, 51], iterative 
alignment [10, 11, 12, 14, 15, 19, 24, 47], branch and bound [45], genetic algorithms [36, 37], 
simulated annealing [26] or on Hidden Markov Modeling (HMM) [8, 9, 21]. For an extensive 
review of the various heuristics for msa construction, we refer the reader to excellent survey 
articles of Kemena and Notredame [25], Notredame [33, 34, 35], Edger and Batzoglou [12], 
Gotoh [15], Wallace et al. [52], Blackshields et al. [5]. 

In heuristics based on progressive alignment, the msa is constructed by first computing 
pair-wise sequence distances using optimal pair-wise global alignment scores. Second, a 
clustering algorithm (UPGMA or NJ [46]) uses these pair- wise sequence distances to con- 
struct a rooted binary tree, usually referred to as guide tree. Finally, an agglomerative 
algorithm uses this guide tree to progressively align sequences a pair at a time to construct 
a msa. The pair-wise global alignments scores are usually computed using a substitution 
matrix and a gap penalty scheme that is based on sequence similarity. ClustalW [51] was 
among the first widely used progressive alignment tool on which many of the current day 
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progressive aligners are based. In this paper, our focus is on heuristics that are based on 
progressive alignment mainly because in this method the computation of pair-wise sequence 
distances, guide tree and the choice of agglomerative algorithm for progressively pair-wise 
aligning sequences can be essentially split into three independent steps. This helps to pro- 
vide a flexible algorithmic framework for designing simple parameterized greedy algorithms 
that are computationally scalable and whose parameters can be tuned easily to improve its 
accuracy. In addition, the alignments obtained through this approach are usually a good 
starting point for other popular approaches like iterative, branch and bound, and HMM. 
However, the progressive aligners because of their greedy approach commit mistakes early 
in the alignment process that are usually very hard to correct even when using sophisticated 
iterative aligners. This problem can be addressed if we can incorporate into the pair-wise 
scoring scheme the information for every pair of sites the frequency at which the residues at 
these sites are involved in alignments involving other sequences in S. However, incorporating 
this information for all pairs of sites based on all sequences in S is computationally infeasible. 

The consistency based heuristics [6, 10, 11, 24, 28, 38, 39, 40, 42, 43, 47] tackle this problem 
by incorporating a larger fraction of this information at a reasonable computational cost as 
follows: The score for aligning residues at a pair of sites is estimated from a collection of pair- 
wise residue alignments named the library. The library is constituted of pair- wise alignments 
whose residue alignment characteristics are implicitly assumed to be similar to an optimal 
msa or a reference alignment that was constructed using sequence independent methods. 
For a given library, any pair of residues receives an alignment score equal to the number of 
times these two residues have been found aligned either directly or indirectly through a third 
residue. The Consistency based progressive aligners generally construct msas that are more 
accurate than the pure progressive aligners like clustalW. However, it is not very clear how 
to construct a library of alignments whose reside alignment characteristics are guaranteed 
to be similar to an optimal msa. In addition, the increased accuracy of msa of consistency 
based aligners comes at a computational cost that is on an average k times more than a pure 
progressive aligner. T-Coffee [38], ProbCons [6], MAFFT [24], M-Coffee [53], MUMMALS 
[41], EXPRESSO [2], PRALINE [42], T-Lara [4] are some of the widely used consistency 
based aligners. 

Currently, advances in bio-technology have made available massive amounts of functional, 
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structural and genomic data for many biological sequences. This increased availability of het- 
erogeneous biological data has resulted in biological applications where an msa is required 
for aligning similar features, where a feature is described in structural, functional or evolu- 
tionary terms. In these applications, for a given set of sequences, depending on the feature of 
interest the optimal msa is likely to be different, and sequence similarity can only be used as 
a rough initial estimate on the accuracy of an msa. In addition, from evolutionary studies we 
know that structure and function of biological sequences are usually more conserved than the 
sequence itself. This has motivated the growth in template based heuristics [50] that supple- 
ment the sequence information with evolutionary, structural and functional data and exploit 
feature similarity instead of sequence similarity to construct multiple sequence alignments 
that are biologically more accurate. In these methods, each sequence is associated with a 
template, where a template can either be a 3-D structure, a profile or prediction of any kind. 
Once a template is mapped onto a sequence, its information content can be used to guide 
the sequence alignment in a sequence independent fashion. Depending on the nature of the 
template one refers to its usage as structural extension or homology extension. Structural 
extension takes advantage of the increasing number of sequences with an experimentally 
characterized homolog in the PDB database, whereas homology extension uses profiles. 3-D 
Coffee [3], EXPRESSO [3], PROMALS [42, 44] and PRALINE [47] are some popular aligners 
that are widely used tools that employ template based methods. For more details about tem- 
plate based methods we refer the reader to Kemena and Notredame [25] and Notredame [34]. 

In template based methods, we can view each template once mapped to a sequence as essen- 
tially partitioning the sequence into segments, where each segment corresponds to a feature 
described by the template. Then, we construct a msa by essentially aligning segments that 
share similar features. The current frameworks for describing templates do not allow the 
user to explicitly specify information that can help (i) classify features into types and (ii) 
associate a weight signifying the relative importance of a feature with respect to other fea- 
tures, even though in many instances this additional information is readily available. This 
has resulted in the use of ad hoc measures and algorithms to define feature similarity and 
msa construction respectively. 

In this paper, we 

- provide a mechanism where as a part of the template information the user can explicitly 
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specify for each feature, its type, and weight. The type is to classify the features into 
different categories based on their characteristics and the weight signifies the relative 
importance of a feature with respect to other features in that sequence. 

- define scoring models for pair-wise sequence alignment that assume segment conser- 
vation as opposed to single character (residue) conservation. Our scoring schemes for 
aligning pairs of segments are based on segment type, segment weight, information 
content of an optimal local alignment involving that segment pair, and its supporting 
context. This is an attempt to define scoring schemes that evaluate a pair-wise global 
alignment through information content of a global segment alignment, where segments 
correspond to features within sequences. For example, in a structurally correct align- 
ment the focus is on aligning residues that play a similar role in the 3D structure of 
the sequences, whereas a correct alignment from an evolutionary viewpoint focuses 
on aligning two residues that share a similar relation to their closest common ances- 
tor, and in a functionally correct alignment the focus is on aligning residues that are 
known to be responsible for the same function. The supporting context consists of set 
of sequences that are known to belong to the same family (i.e. share similar structure, 
function or homology to a common ancestor) as the given sequence pair and can help 
determine to what extent the alignment of the features in that pair-wise alignment is 
consistent with the alignment of these features with other sequences in the family. 

- present a fast progressive alignment based heuristic that essentially constructs global 
msa by first classifying segments into informative or non- informative segments based 
on their information content determined using segment scoring matrices. Then, using 
exact methods, we construct a global msa involving only the informative segments. 
Finally, using approximate methods we construct the alignment of non-informative 
segments and stitches them into the alignment of informative segments. 

Remark: The statistical theory for evaluating alignments in terms of its information content 
was developed for local alignments by Karlin and Altschul [23] . However, their theory do not 
extend to the case of global alignments. The pair-HMMS provide a framework for statistical 
analysis of pair-wise global alignments for complex scoring schemes using standard methods 
like Baum- Welch and Viterbi training. However, determining the right set of parameters 
for optimal statistical support is highly non-trivial and involves dynamic programming algo- 
rithms with computational complexity that is quadratic in the length of the given sequences. 
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The rest of this paper is structured as follows. In Section 2, we define the problem and 
introduce the relevant terms and notations to define our segment scoring schemes and heuris- 
tics. In Section 3, we present our segment scoring schemes. In Section 4, we present our 
heuristics, in Section 5, we describe our experimental set-up and summarize our preliminary 
experimental results, and in Section 6, we present our conclusions and future work. 

2. Preliminaries 

In this section, we first define the problem of msa construction for a given a set of sequences 
and their segment decompositions, where each segment is classified into one of many types 
and is associated with a weight reflecting its importance relative to other segments within 
that sequence. Then, we introduce some basic terms and definitions that are required for 
defining our scoring models and heuristics. 

2.1 Problem Definition 

Let S — { Si, , Sk } be a set of k related protein sequences each of length n. For i G [1..A;], 
let Bi = {B}, ,B'i 1 } be the decomposition of Si into rij segments. Each segment s G Bi is 
classified into one of many types based on the type of features that are known/predicted to 
be present in that segment, and is associated with a non- negative real number weight that 
reflects the importance of the feature associated with that segment relative to other features 
in that sequence. That is, each segment s G Bi, i G [l..fc], is associated with a type type(s), 
and a non- negative real number weight weight(s). 

Example: If the sequences in S are partitioned into segments based on their predicted 
secondary structure, each segment is classified into one of three types helix, strand or a coil, 
is associated with a non-negative weight in the interval [1, 10] that reflects the confidence in 
its secondary structure classification. 

Given a set S of k biological related sequences, their decomposition into segments, and 
the type and weight associated with each of these segments, our goal is to design fast pro- 
gressive alignment based heuristics that exploit the information content in these segments 
to build a biologically significant multiple sequence alignment. 
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2.2 Basic Terms and Definitions 



Now we introduce some terms and definitions that will necessary for defining our segment 
scoring models and heuristics. 

Definitions 2.1 Fori e [l..k], we define 

- Bj 1 * = {s G Bi : weight(s) > a} to be the segments in Si whose weights are greater 
than or equal to a, where a is a non-negative user specified real number parameter. 
We refer to the segments in B™* as informative segments of S^; 

- Sl n * to be the subsequence of Si obtained by concatenating the segments in B™ in 
the order in which they appear in Si. We refer to this subsequence as the informative 
sequence of Si. 

Definition 2.2 For a pair of segments s G B^ and t G Bj 1 * of the same type, i ^ j G 
we define L H (s,t) to be the local alignment between s andt constructed using heuristic 
H and BLOSUM62 scoring matrix and SEG H (s, t) to be bit score correpsonding to L H (s, t). 

Definitions 2.3 Fori ^ j e [l..k], we define 

- aiij to be a real number in the interval [0,2] that reflects the level of divergence between 
Si and Sj . We estimate the level of divergence between Si and Sj using the bit score of a 
local alignment between S™^ and SJ 1 * constructed using heuristic H and BLOSUM62 
scoring matrix; 

- c : [0 — 2] — > i? + is a function that computes for a given level of divergence the infor- 
mation threshold for an alignment to be informative. 

Definitions 2.4 For a segment s e Bj 1 * and j ^ i e [i—k], we define 

- Neighbored) = {t G Bf f : type(t) = type(s) A SEG H (s, t) > c(a id ) * \s\} to be the set 
of informative segments t in Sj of the same type as s with bit score of a local alignment 
between s and t greater than or equal to c(aij) * \s\. We refer to the segments in 
Neighbor j(s) to be the neighbors of s in Sj. 

- Closest - neighbor j(s) = {u' G Bj lf : SEG H (s,u') = max teNeighbor . {s) SEG H (s,t)} is 
the neighbor of s in Sj that maximizes the bit score of a pair-wise local alignment with 
s. We refer to such a segment to be the closest neighbor of s in Sj. 
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- Neighborhoods) = \Jje[i ..k\Closest — neighbor j(s) to be the set consisting of a closest 
neighbor of s from each sequence in S. 

Definitions 2.5 Fori e [l..fc], 

- B^ e% = {s G Bi : s is a neighbor of some segment in S \Si. }. 

- S™ ei to denote the subsequence obtained by concatenating the segments in B™ e% in the 
order in which they appear in Si. We refer to this subsequence as the neighbor sequence 
of Si. 

Definitions 2.6 For each pair of segments s G S™ et and t G S™ ei of the same type from 
distinct sequences in S, and I ^ i,j G [l..fc], we define 

- Mutual — neighbor si(s,t) = {u e B^ 1 : u G Neighborly) f| Neighbor j(u)} to be the 
segments in S[ that are neighbors of both s and t. 

- Closest - mutual - neighbor's, t) = {u' G BJ iei : SEG H {s,u') + SEG H (t,u') = 
max ue m utuai-neighbor Sl (s,t)( SEGH ( s ' u ) + SEG H (t,u)) to be the mutual neighbor u of s 
and t in Si that maximizes SEG H (s, u) + SEG H (t, u). We refer to such a segment as 
the closest mutual neighbor of s and t in Si. 

- Mutual — neighborhood^, t) = \Jje[i..k] Closest — mutual — neighbor j(s,t) to be the 
set consisting of a closest mutual neighbor of s and t from each sequence in S. 

3. Scoring Models for Global Segment Alignment 

In this section, we define scoring models for pair-wise segment alignment of sequences. We 
classify segments into informative and non-informative based on their weight and construct 
segment scoring matrix entries only for informative segments. Restricting the segment scor- 
ing matrix entries to only informative segments helps to significantly reduce the computa- 
tional time of our heuristics with minimal impact on alignment accuracy. In Section 3.1, 
we introduce scoring schemes for aligning pairs of informative segments. In Section 3.2, we 
introduce scoring schemes for aligning a segment with a gap. 
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3.1 Scoring Schemes for Aligning Pairs of Segments 

We now introduce the following three scoring schemes for aligning a pair s,t of informative 
segments of the same type: (i) Progressive scoring; (ii) Linear Consistency scoring, and (hi) 
Quadratic Consistency scoring. 

Progressive scoring: SCORE(s,i) = SEG H (s,t). In this scheme, we only make use 
of the information content of a pair wise local alignment between segments s and t con- 
structed using heuristic H and BLOSUM62 scoring matrix. 

Linear consistency scoring: 

SCORE(s,t) = \Mutual - neighborhoods, t)\ * EueMutuai-nei g hborhood(s,t)(SEG H (s,u) + 
SEG H (t,u)). In this scheme, we make use of the information from both (i) pair-wise local 
alignment between s and t, and (ii) pair-wise alignments involving the segments s and t with 
segments in their mutual neighborhood. 

Quadratic consistency scoring: 

SCORE(s,t) = \Mutual - neighborhoods, t)\ 2 *E u eMutuai-ne ig h b orhoo d (s,t) (SEG H (s,u) + 
SEG H (t,u)). In this scheme, the information obtained through the alignment of two con- 
served segments of two diverging sequences is weighed more than the information obtained 
through the alignment of two non-conserved segments of two closely related sequences. 

3.2 Scoring Schemes for Aligning a Segment with a Gap 

We now introduce the following two scoring schemes for aligning an informative segment 
with a gap: (i) Zero gap penalty, and (ii) Maximum gap penalty. 

Zero gap penalty: SCORE(s, — ) = 0. In this scheme, we do not penalize the dele- 
tion of any informative segment. 

Maximum gap penalty: SCORE(s, -) = max teJVeigMorfeo()(i(s) SEG H (s,t). In this scheme, 
the gap penalty of s G B; L based on the informative segment t e S \ «S) of the same type that 
maximizes the bit score of a pair-wise local alignment between s and t. 
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4. Heuristics for msa Construction 



We now present a generic framework for designing template based fast progressive alignment 
heuristics that construct global msa as follows: 

(i) construct DIST nei , a matrix of pair-wise sequence distances, based on scores of pair- 
wise global segment alignment involving only the informative segments; 

(ii) construct a guide tree G nei using NJ algorithm using DIST ne% and build MSA ne \ a 
msa of the informative segments, by progressively pair- wise segment aligning sequences 
consistent with G nez ; 

(iii) construct the pair-wise global alignment of the residues in non-informative segments 
using fast approximate methods and stitch them back into MSA nei . 

In Section 4.1, we describe our heuristic and in Section 4.2, we present our Heuristic A. 
4.1 Description of Our Heuristic 

Construction of pair-wise sequence distances: We now describe how we compute the 
pair-wise sequence distances for each pair of sequences in S. 

Definitions 4.1 Fori,j e [l..k], we define 

- a global segment alignment between two sequences Si and Sj to be an alignment where 
a segment in Si is either aligned to a gap or another segment in Sj of the same type; 

- G net (i,j) to be the optimal global segment alignment between S™ e% and S™ ei constructed 
using the segment scoring matrix SCORE;. 

- DIST nei (i, j) to be the score corresponding to G nei (i,j). 

Notice that if each segment consists of a single amino acid then the global segment alignment 
is the same as a traditional global alignment. In this case, the traditional scoring matrices 
can be used to score alignments between segments. Otherwise, one needs to determine an 
appropriate segment scoring matrix and then using Needleman-Wunsch's [32] dynamic pro- 
gram construct an optimal global pair-wise segment alignment. 

Construction of Guide Tree and msa of Informative Segments: We construct the 
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guide tree G nei from pair-wise sequence distance matrix DIST nei using the Neighbor Join- 
ing (NJ) algorithm. Then, we construct M nei , the msa of informative of sequences in S, 
by progressively pair-wise globally segment aligning the sequences S™ e \ S% ei , , S^ ei consistent 
with G nei . 

Stitching the sites in non-informative segments into msa of informative sequences: 
We will describe now for each pair of sequences Si and Sj that were progressively aligned 
while constructing M ne \ how we stitch the alignment of sites in Si and Sj that were either 
in non-informative segments or in non-aligned portion of informative segments into M net . 
First, we introduce some necessary definitions. 

DEFINITIONS 4.2 For a pair of informative segments s e B™ and t e Bj 1 of the same type, 
let L H (s,t) be the local alignment of s and t constructed using heuristic H and BLOSUM62 
matrix, we now define 

- PREFIX S (L H (s, t)) to be the prefix of segment s that is not part of the local alignment 
L H (s,t); 

- SUFFIX S (L H (s,t)) to be the suffix of segment s that is not part of the local alignment 
L H (s,t). 

Let G nei (i,j),i ^ j G [l.-fc], denote an optimal global segment alignment between sequence 
S™ ei and Sj iei constructed using the segment scoring matrix SCORE. For G nei (i,j), we say 
a segment s E Si to be a matched segment if in G nei (i,j) it is aligned to a segment t G <S" ei , 
otherwise it is an unmatched segment. We now present a procedure stitch that stitches the 
alignment between the sites in Si that occur between any two consecutive matched segments 
s and s and the sites in Sj that occur between the corresponding matched segments t and t 
into G nei (i,j). 

Procedure Stitch(s, s): 

- Let p,p (q, q) be the respective indices of segments s, s (t, i) in Si and Sj; 

- Let A = SUFFIXi(Bf,B])\Jlp +1 Bl (J PREFIX^Bf , Bj) be the sequence of sites 
in Si that are either in non-informative segments or unaligned portions of informative 
segments in G nei (i,j); 
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- B = SUFFIXj(Bf, Bj) Uf=q+i B l - U PREFIXj(B*j, be the sequence of sites in 
Sj that are either in non-informative segments or unaligned portions of informative 
segments in G net (i, j); 

- Globally align segments A and B using BLOSUM62 scoring matrix and any fast linear 
time heuristic and then insert this alignment between segments s and s in G nei (i,j). 

4.2 Heuristic A(a, H, c) 
Parameters: 

(1) a: a non- negative real number; 

(2) H: an algorithm/heuristic for pair-wise local alignment of sequences; 

(3) c: a function that maps for any given level of divergence in the interval [0, 2] to 
the information threshold for an alignment to be informative. 

Inputs: 

(1) S = {S\, Sk}: the set of k input sequences; 

(2) B = {Bi, B k }: the set consisting of the segment decompositions of the se- 
quences in S, where each segment s is associated with a type type(s) and weight 
weight(s); 

Main Heuristic 

(1) For i e [l..k], construct B 1 ^ = {s G Bi : weight(s) > a } and the sequence 
of informative segments. 

(2) For each pair of informative segments s e B™ and t G Bj 1 of the same type, 
using heuristic H and BLOSUM62 scoring matrix construct L H (s, t) and compute 
SEG H (s,t). 

(3) For i 7^ j G [1..A;], set a^j to the bit score per unit length corresponding to 
L H (Sl n f , Sj 1 ), the local alignment between S^ and Sj 1 ^ constructed using 
heuristic H and BLOSUM62 scoring matrix. 

(4) For each informative segment s G B 1 ^ and j ^ i G [l.-k], compute the following: 

(i) Neighbor's) = {t G Bf s : type(t) = type(s) A SEG H (s,t) > c(a id ) * \s\}. 
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(ii) Closest-neighbor^s) = {u> G Bf } : SEG H {s,u') = max teNeighbor . (s) SEG H \s, 

(iii) Neighborhood(s) = \Jj e [i..k] Closest — neighbor j(s). 

(5) For each sequence Si G S, i G [l..fc], construct I?™ e * = {s G fij : s is a neighbor of 
some segment in S \Si. } and S % i, the neighbor sequence of 

(6) For each pair of segments s G S™ e% and t G S"™ 6 * from distinct sequences and 

compute 

(i) Mutual — neighbors^, t) — {u G Sf ei : -u G N eighbor^u) f] Neighbor j(u)} . 

(ii) Closest-mutual-neighbor -fat) = {u 1 G Bf ei : SEG H {s,u')+SEG H (t,u') = 
rnax uemut ^_ neighborsi(Stt) (SEG H (s, u) + u))}. 

(iii) Mutual — neighborhood's, t) = \Jje[i..k] Closest — mutual — neighbor j(s, t). 

(7) For each segment s G B™ e \ i G [l..fc], compute SCORE(s, — ). 

(8) For each pair of segments s G -B™ 61 and £ G -B™ ei of the same type, compute 
SCORE(s,t). 

(9) For i ^ j £ [1.1], compute DIST net (i, j) by globally segment aligning iS™ 61 and 
iS™ 6 * using Needleman-Wunch's dynamic program and segment scoring matrix 
SCORE. 

(10) We now construct the msa of S as follows: 

(i) Construct guide tree T nei from DIST nei using the Neighbor Joining (NJ) 
algorithm. 

(ii) Construct M ne% by progressively globally segment aligning the sequences 
Si e \ S% et a pair at a time consistent with T nei . 

(iii) For each pair of sequences S™ e% and S™ ei that were progressively aligned while 
constructing M nei , 

- Let G net (i,j) denote the global segment alignment of S™ ei and S™ ei ; 

- For each pair s,s of consecutive matched segments of S™ ei in G nei {i,j) 
(where t, t are the corresponding matched segments of S™ 1 ) use procedure 
stitch to stitch the alignment between the sites in Si that occur between 
s and s and the sites in Sj that occur between the sites in t and t to 
G ne %j). 
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5. Experimental Results 



In this section, we first describe our experimental set-up, then we describe how we evaluate 
the performance of our heuristic, and finally we summarize our preliminary experimental 
results. 

5.1 Experimental Set-up 

Our computational experiments have been set-up with the focus on analyzing the perfor- 
mance of our heuristics for sequences from protein families in the PFAM [13] database for 
which (i) accurate reference alignments were available either through structural aligners or 
through other sequence independent biological methods, and (ii) annotations describing the 
salient biological features were available for each sequence. We chose 12 sets of sequences 
ranging from 5 to 23 sequences with sequence similarity ranging from 20% to 80%. For these 
sequences, we used PSIPRED [22], a widely used structure prediction tool, to partition each 
sequence into segments based on their secondary structure characteristics. PSIPRED classi- 
fies each segment into one of three types helix, strand or a coil, and associated a non-negative 
weight in the interval [1,10] reflecting the confidence in its partitioning and classification. 
Then for these sets of sequences, we construct an msa by using our heuristic A(a, H, c), where 
a is a non-negative real number parameter for classifying segments based on their weights 
into informative and non-informative segments, H is an algorithm/heuristic for pair-wise 
local alignment of segments, and c is a function that maps for any given level of divergence 
in the interval [0, 2] to the information threshold for an alignment to be informative. In our 
experiments, we have set a to be 6. That is a segment is considered to be informative if 
its average segment weight > 6 (i.e. a > 6) and its length is at least 5. In addition, if 
two informative segments of the same type are separated by less than 4 residues we merged 
the two segments with the intervening residues into a single informative segment. We set 



s 



ASTP [1, 30] with slight modifica- 
We defined the function c based 



H, the algorithm/heuristic for local alignment to be B 
tions to handle alignments involving short sequences, 
on the average bit scores of BLOSUM matrices corresponding to different levels of sequence 
divergence. 



lr The quality of alignment constructed using Smith- Watermans dynamic program was not significantly 
different from that obtained using BLASTP. 
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5.2 Evaluation the Performance of Our Heuristic 

We evaluate the performance of our heuristic based on (i) the accuracy of its msa in com- 
parison with an reference alignment, and (ii) its computational efficiency for the appropriate 
choice of its parameters a, H and c, 

Evaluating accuracy of an msa: The traditional sequence similarity based measures 
like SP score and Tree score have only been helpful in providing a crude estimate of the 
alignment quality and measures based on structurally correct alignments are likely to be 
better alternatives for evaluating alignment accuracy. So, for sequences for which their 3D 
structure is known, the accuracy of an msa can be evaluated in comparison with reference 
alignments constructed through a structure aligner. We also observe instances of homolo- 
gous sequences that share only a few features and yet preserve their overall structure and 
function. In these instances, local feature conservation is another good predictor of align- 
ment accuracy. So, we measure the accuracy of the msas constructed by our heuristic in 
terms of the percentage correlation between the columns in the multiple sequence alignments 
constructed by our heuristic and the columns of the sites within the reference alignment that 
correspond to conserved features. 

Note: Our heuristics make use of the secondary structure predictions from PSIPRED. So, 
any inaccuracies in the secondary structure prediction of PSIPRED should also be factored 
while evaluating the accuracy of msa constructed by our heuristics. We factor this in terms 
of the correlation between the informative sites in our heuristic and the sites in the reference 
alignment that correspond to conserved features. We also restrict the impact of inaccuracies 
in secondary structure prediction on msa accuracy by conservative choice of the information 
threshold function c (i.e. higher than if we had an accurate partitioning and correct classi- 
fication of segments). 

Evaluation of Computational Efficiency: Our heuristics attempt to minimize its com- 
putational time with minimal impact on its accuracy by first classifying the segments within 
each sequence into informative (non-informative) segments based on its weight exceeding 
(not exceeding) a. Then, the msa is essentially constructed by first progressively pair-wise 
aligning the sites in informative segments using exact methods and then us linear time ap- 
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proximate heuristics to align the sites in non- informative segments and stitch them back 
into the alignment of sites in informative segments. So, the saving in computational time 
depends on the fraction of the segments that are informative. This in turn depends mainly 
on the choice of the information threshold a. 



5.3 Summary of Preliminary Experimental Results 



Protein 


( #of 


Sequence 


% Sequence 


# of informative 


Avg Length of 


% Local 


Family 


Sequences 


Lengths 


Similarity 


segments 


informative segment 


Similarity 


PF13420 


21 


152-164 


20%-70% 


4 


10 


< 70% a 


PF13652 


11 


131-152 


65%-85% 


4 


12 


> 90% 


PF13693 


22 


77-81 


55%-83% 


3 


12 


> 90% 


PF13733 


5 


133-142 


55%-61% 


2 


8 


> 90% 


PF13844 


6 


449-481 


68%-78% 


7 


12 


> 90% 


PF13856 


23 


90-112 


30%-73% 


3 


10 


> 80% a 


PF13944 


21 


120-146 


30%-85% 


3 


10 


> 90% 


PF14186 


11 


152-157 


38%-68% 


4 


8 


> 90% 


PF14263 


10 


120-129 


50%-66% 


3 


10 


> 90% 


PF14274 


20 


155-165 


36%-71% 


3 


12 


> 90% 


PF14323 


18 


485-548 


36%-43% 


6 


11 


> 90% 



a: Quadratic Consistency and Max Gap Penalty Scoring Schemes was employed. 



Table 1: Summary of msa results using Linear Consistency and Max Gap Penalty Schemes 

6. Conclusions and Future Work 

Our preliminary experimental results indicate that our template based heuristic framework 
can help in designing heuristics that can exploit template based information to construct 
msas that are biologically accurate in a computationally efficient manner. However, we 
would like to (i) make use of extreme value distribution [16] to define the the function c that 
maps for a given level of sequence divergence the information threshold for an alignment 
to be informative; (ii) Understand how to define the segment scoring schemes for aligning 
sequences that are highly divergent; (iii) evaluate the accuracy of the alignments constructed 
by our heuristics by using sequence independent measures [2, 25, 34] on challenging datasets 
in BAliBASE [27, 28]. 
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